PATRIXA: a unification-based parser for Basque and its application to the automatic analysis of verbs
نویسندگان
چکیده
In this chapter we describe a computational grammar for Basque, and the first results obtained using it in the process of automatically acquiring subcategorization information about verbs and their associated sentence elements (arguments and adjuncts). The first part of this chapter (section 1) will be devoted to the description of Basque syntax, and to present the grammar we have developed. The grammar is partial in the sense that it cannot recognize every sentence in real texts, but it is capable of describing the main syntactic elements, such as noun-phrases (NPs), prepositional phrases (PPs), and subordinate and simple sentences. This can be useful for several applications. Next, the syntactic grammar will be used by a syntactic analyzer (or parser) to automatically acquire information on verbal subcategorization from texts (section 2). The results will later be used by a linguist or processed by statistical filters. This work has been done by the IXA Natural Language Processing research group, centered on the application of automatic methods to the analysis of Basque. Comparing to other languages (English, German, French, ...) Basque can be considered as a minority language due to the following constraints: • Limited number of language users. This fact implies a reduced number of researchers/developers of computational linguistic tools. • Limited number of language resources, in the form of computational lexicons, grammars, corpora, annotated treebanks or dictionaries. These are the main reasons that have compelled the IXA group to the development of automatic methods for the analysis of linguistic data. The work described in this chapter is a part of this effort. 1 THE SYNTACTIC ANALYZER 1.1 A BRIEF INTRODUCTION TO COMPUTATIONAL SYNTAX The computational treatment of syntax has long been an area of research. From 1950, when the first automatic translation systems were created, many researchers have studied the syntactic relationships among words and the way they are combined to form sentences. However, the task was more difficult than expected. Nowadays, there is no system capable of syntactically analyzing any sentence in real texts, such as newspapers. At the moment, the best syntactic analyzers have been developed for English, but they find an unsolvable obstacle in the form of ambiguity, because many common sentences can produce tens or even hundreds of different syntactic analyses. In this context, we can distinguish two approaches to computational syntax, according to their main objective: • Full parsing. The aim is to construct more accurate and complete grammars and parsers, with the objective of syntactically analyzing any sentence. As we have noted earlier, the state of the art is still far from this objective. • Partial parsing. In many systems the objective is not to completely analyze a sentence, but to detect several syntactic elements, such as NPs, verb chains or simple sentences. These pieces of information, also called FKXQNV (Abney 1997), are useful for several linguistic applications, as information retrieval or speech synthesis. Regarding the main kind of knowledge employed, we can classify syntactic analyzers in four groups: • Unification-based analyzers (Shieber 1986). These systems are based on context-free grammars (Chomsky 1957) with the addition of information to syntactic elements and rules by means of feature structures (see subsection 1.2). • Finite state analyzers (Karttunen et al. 1997). They are mainly dedicated to partial parsing, that is, they typically distinguish the different components of a sentence. Grammars are defined using regular expressions. • Constraint grammar (Karlsson 1995). To analyze a sentence, this formalism begins with all the options to analyze each individual word-form, and the task of the grammar is to discard as many options as possible until each word contains a single analysis that gives information about number, case, person and syntactic category. This formalism is called reductionistic because it starts from all the possibilities and it ends only when the correct one is selected. • Statistical methods. These systems automatically acquire syntactic information (in the form of context-free grammars or regular expressions) from big corpora. The information thus obtained is used to analyze new sentences. Usually, statistical methods are not used in isolation, but combined with other methods (Collins 1997). The IXA natural language processing group has developed two syntactic analyzers for Basque, one using a unification-based formalism and another one based on a Constraint Grammar. Work on this second formalism is described in (Aduriz et al. 1997; Arriola 2000; Aduriz 2000; Aduriz and Arriola 2001). In this chapter we will describe a unification grammar for Basque together with its application to the task of automatically extracting verbal information from text corpora. Regarding computational grammars and syntactic analyzers for languages other than Basque we can cite the following: • Natural Language Software Registry: http://registry.dfki.de • Computational Linguistics (on-line presentations): http://www.ifi.unizh.ch/CL/InteractiveTools.html#as-h2-3296 Or else, if we want to experiment directly with a syntactic analyzer: • Syntactic analyzer for English: http://www.conexor.fi • Syntactic analyzer for Spanish (CliC): http://clic.fil.ub.es/equipo/index_en.shtml 1.2 UNIFICATION-BASED GRAMMAR FORMALISMS AND PATR Unification-based grammar formalisms are based on context-free grammars (CFG). CFGs were formalized by Chomsky (1957), and they define a grammar as shown in Table 1. (QJOLVK JUDPPDU %DVTXH JUDPPDU S J NP VP VP J Verb NP NP J Noun NP J Det Noun S J NP VP VP J NP Verb NP J Noun NP J Pronoun Table 1. Two examples of context-free grammars. Context-free rules are of the form ‘a J b’ or ‘a J b c’, where D is a non-terminal syntactic category and E, F are terminals (lexical elements) or non-terminals. Non-terminal symbols (S, NP, PP, ...) are syntactic categories, while terminals are words or morphemes from a lexicon. The chains of terminal symbols that can be derived from the first symbol (or axiom) of the grammar (6 or sentence in the example) will be the sentences of the language. A sentence belonging to the grammar will be typically described by a tree. For example, Figure 1 shows an analysis tree of a sentence derived using the rules for the Basque grammar in Table 1.
منابع مشابه
Exploring Portability of Syntactic Information from English to Basque
This paper explores a crosslingual approach to the PP attachment problem. We built a large dependency database for English based on an automatic parse of the BNC, and Reuters (sports and finances sections). The Basque attachment decisions are taken based on the occurrence frequency of the translations of the Basque (verb-noun) pairs in the English syntactic database. The results show that with ...
متن کاملFeature Engineering in Persian Dependency Parser
Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, data-driven dependency parser has been developed with the help of phrase-structure parser fo...
متن کاملA Bootstrapping Approach to Parser Development
This paper presents a robust parsing system for unrestricted Basque texts. It analyzes a sentence in two stages: a unification-based parser builds basic syntactic units such as NPs, PPs, and sentential complements, while a finite-state parser performs syntactic disambiguation and filtering of the results. The system has been applied to the acquisition of verbal subcategorization information, ob...
متن کاملDeveloping a Semantic Similarity Judgment Test for Persian Action Verbs and Non-action Nouns in Patients With Brain Injury and Determining its Content Validity
Objective: Brain trauma evidences suggest that the two grammatical categories of noun and verb are processed in different regions of the brain due to differences in the complexity of grammatical and semantic information processing. Studies have shown that the verbs belonging to different semantic categories lead to neural activity in different areas of the brain, and action verb processing is r...
متن کاملConcordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms
In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004